Pretty obvious answer of K=2 there.
For K=2, I plotted the proportion of each sample’s genome that was assigned to eith er K1 or K2:
Given the high amounts of missingess from some of the regions, I wanted to make sure that missingness doesn’t predict a sample’s assignment to one of the two populations. The below plot shows frequency of missing genotypes (y-axis) across samples (x-axis); each sample is colored by the proportion of the genome assigned to K2:
## [1] 0.01368808
## [1] 0.009160282
## [1] 0.007722963
-Regions of interest on: - PC1: Two hits on chrm2 () and one on chrm11 - PC2: Region upstream of dhps on chrm 8 and aat1 on chrm 6 - PC3: dhps, crt
randomForest package to conduct supervised machine learning
##
## Call:
## randomForest(formula = loc2 ~ ., data = train, importance = TRUE, ntree = 10000, mtry = 200, proximity = TRUE, type = classification)
## Type of random forest: classification
## Number of trees: 10000
## No. of variables tried at each split: 200
##
## OOB estimate of error rate: 21.59%
## Confusion matrix:
## North South class.error
## North 120 47 0.2814371
## South 29 156 0.1567568
## predictions.test.data
## North South
## North 37 19
## South 7 55
## [[1]]
## [1] 0.9003456
We can slightly improve this by adjusting the rf paramaters, but it’s nothing to write home about
Here are the SNPs associated with the north/south divide, as well as some QC:
##
## Call:
## randomForest(formula = pop ~ ., data = train, importance = TRUE, ntree = 10000, mtry = 200, type = classification)
## Type of random forest: classification
## Number of trees: 10000
## No. of variables tried at each split: 200
##
## OOB estimate of error rate: 8.32%
## Confusion matrix:
## 0 1 class.error
## 0 141 31 0.18023256
## 1 15 366 0.03937008
## prediction.test.data
## 0 1
## 0 45 12
## 1 7 120
## [[1]]
## [1] 0.8862412
TO DO: fix seed issue ## Some additional checks
How about treating K as a continious variable (so using regression instead of classification):
##
## Call:
## randomForest(formula = K1 ~ ., data = train, importance = TRUE, ntree = 1000, mtry = 200)
## Type of random forest: regression
## Number of trees: 1000
## No. of variables tried at each split: 200
##
## Mean of squared residuals: 0.02990619
## % Var explained: 76.14
## tree variable minimal_depth
## 1 1 chr1_155978 14
## 2 1 chr10_116392 10
## 3 1 chr10_1414588 5
## 4 1 chr10_1519673 30
## 5 1 chr10_1554604 4
## 6 1 chr10_377197 12
## 7 1 chr10_424305 11
## 8 1 chr10_669908 8
## 9 1 chr10_699403 13
## 10 1 chr11_1020630 10
##
## Call:
## randomForest(formula = loc2 ~ ., data = train, importance = TRUE, ntree = 1000)
## Type of random forest: classification
## Number of trees: 1000
## No. of variables tried at each split: 40
##
## OOB estimate of error rate: 16.44%
## Confusion matrix:
## North South class.error
## North 76 27 0.26213592
## South 10 112 0.08196721
## prediction.test.data
## North South
## North 29 5
## South 1 40